The CMU Submission for the Shared Task on Language Identification in Code-Switched Data
نویسندگان
چکیده
We describe the CMU submission for the 2014 shared task on language identification in code-switched data. We participated in all four language pairs: Spanish–English, Mandarin–English, Nepali–English, and Modern Standard Arabic–Arabic dialects. After describing our CRF-based baseline system, we discuss three extensions for learning from unlabeled data: semi-supervised learning, word embeddings, and word lists.
منابع مشابه
Overview for the First Shared Task on Language Identification in Code-Switched Data
We present an overview of the first shared task on language identification on codeswitched data. The shared task included code-switched data from four language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA), MandarinEnglish (MAN-EN), Nepali-English (NEPEN), and Spanish-English (SPA-EN). A total of seven teams participated in the task and submitted 42 system runs. The evaluation showed t...
متن کاملA Neural Model for Language Identification in Code-Switched Tweets
Language identification systems suffer when working with short texts or in domains with unconventional spelling, such as Twitter or other social media. These challenges are explored in a shared task for Language Identification in Code-Switched Data (LICS 2016). We apply a hierarchical neural model to this task, learning character and contextualized word-level representations to make word-level ...
متن کاملOverview for the Second Shared Task on Language Identification in Code-Switched Data
We present an overview of the second shared task on language identification in codeswitched data. For the shared task, we had code-switched data from two different language pairs: Modern Standard ArabicDialectal Arabic (MSA-DA) and SpanishEnglish (SPA-ENG). We had a total of nine participating teams, with all teams submitting a system for SPA-ENG and four submitting for MSA-DA. Through evaluati...
متن کاملLanguage Identification in Code-Switched Text Using Conditional Random Fields and Babelnet
The paper outlines a supervised approach to language identification in code-switched data, framing this as a sequence labeling task where the label of each token is identified using a classifier based on Conditional Random Fields and trained on a range of different features, extracted both from the training data and by using information from Babelnet and Babelfy. The method was tested on the de...
متن کاملColumbia-Jadavpur submission for EMNLP 2016 Code-Switching Workshop Shared Task: System description
We describe our present system for language identification as a part of the EMNLP 2016 Shared Task. We were provided with the Spanish-English corpus composed of tweets. We have employed a predictor-corrector algorithm to accomplish the goals of this shared task and analyzed the results obtained.
متن کامل